Image Similarity Search

Image Similarity - Yet Another Search And Sort

This is an experiment that came about from studying how to identify duplicate khipus. One of the insights in that study was that visual similarity worked. Somewhat. So then the question suggests itself: can you sort and search on image similarity? Answer! Of course you can!

There are a number of approaches to solve that problem. Since this is a prototype experiment, I picked an off the shelf package, ran it, and then reverse engineered the output, as shown below. The input sources to the package were the small image-quilt thumbnails used in the Khipu Field Guide. The package I used, pixplot, does a k-means clustering using the pixel “features” in the image, then projects the image onto the 2d plane, using tSNE as the dimensionality reduction algorithm. The final result, known as a raster-fairy is a jittered result that fits into a grid like format.

At the bottom of the page I show the results, and how effective that experiment was.

Reverse Engineering the Output of PixPlot

Code

import json
# Initialize plotly
plotly.offline.init_notebook_mode(connected = False)


pixplot_directory = f"{ku.project_directory()}/data/pixplot/output"
imagelist_file = f"{pixplot_directory}/data/imagelists/imagelist.json"
with open(imagelist_file) as f:
  image_data_dict = json.load(f)
 
# parse and load the data
images = image_data_dict['images']
image_names = [x.replace("_wide.jpg","") for x in images]
image_sizes = image_data_dict['cell_sizes'][0]
image_atlas = image_data_dict['atlas']
image_positions = image_atlas['positions']

Code

atlas_points = [(item[0], item[1]) for item in image_positions[0]]
named_atlas_points = [(name, position[0], position[1]) for (name, position) in  zip(image_names, atlas_points)]
named_atlas_points_df = pd.DataFrame(named_atlas_points, columns=['name', 'x', 'y'])

fig = (px.scatter(named_atlas_points_df, x="x", y="y", 
                 hover_name='name', hover_data=['x', 'y'], 
                 title=f"<b>Image Similarity with Unknown Layout</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show()
      )

Not what I want.

Code

grid_file = f"{pixplot_directory}/data/layouts/grid_layout.json"
with open(grid_file) as f:
  grid_data_dict = json.load(f)

grid_points = [(item[0], item[1]) for item in grid_data_dict]
named_grid_points = [(name, position[0], position[1]) for (name, position) in  zip(image_names, grid_points)]
named_grid_points_df = pd.DataFrame(named_grid_points, columns=['name', 'x', 'y'])

fig = (px.scatter(named_grid_points_df, x="x", y="y", 
                   hover_name='name', hover_data=['x', 'y'], 
                   title=f"<b>Image Similarity with Unknown Layout</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show()
      )

Also, not what I want.

Code

rasterfairy_file = f"{pixplot_directory}/data/layouts/rasterfairy_layout.json"
with open(rasterfairy_file) as f:
  rasterfairy_data_dict = json.load(f)

grid_x_space = round(1./(1-.9166666))
grid_y_space = round(1./(1-.92308))
rasterfairy_points = [(round(item[0]*grid_x_space)+12, round(item[1]*grid_y_space)+13) for item in rasterfairy_data_dict]
named_rasterfairy_points = [(name, position[0], position[1]) for (name, position) in  zip(image_names, rasterfairy_points)]
named_rasterfairy_points_df = pd.DataFrame(named_rasterfairy_points, columns=['name', 'x', 'y'])

fig = (px.scatter(named_rasterfairy_points_df, x="x", y="y", 
                   hover_name='name', hover_data=['x', 'y'], 
                   title=f"<b>Image Similarity in Projected 2D Space (using UMAP/tSNE) Followed by a Gridded Deformation</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show()
      )
named_rasterfairy_points_df

	name	x	y
0	AS001	19	14
1	AS002	19	1
2	AS003	23	11
3	AS004	18	0
4	AS005	13	17
...	...	...	...
654	UR290	24	20
655	UR291A	0	7
656	UR292A	10	22
657	UR293	23	7
658	UR294	22	11

This provides an equivalent of what I need.

Building an HTML Table version of PixPlot’s RasterFairy

Code

max_x = max(named_rasterfairy_points_df.x.tolist())
max_y = max(named_rasterfairy_points_df.y.tolist())
table_dict = {(x,y):"" for x in range(max_x+1) for y in range(max_y+1)}
for row_num in range(len(named_rasterfairy_points_df)):
    record = named_rasterfairy_points_df.iloc[row_num]
    table_dict[(record['x'], record['y'])] = record['name']
#table_dict

Code

def names_in_row(y_index): return [item[1] for item in table_dict.items() if item[0][1] == y_index]
def thumbnail_image_size(aName): return image_sizes[image_names.index(aName)]

def rasterfairy_table_repr():
    table_rep = "<table style=\"position:relative; left:40px; top:200px;padding-bottom:30px;\" >"
    for row_num in range(max_y,-1,-1):
        row_names = names_in_row(row_num)
        table_rep += rasterfairy_table_row_repr(row_names)
    table_rep += "</table>"
    return table_rep

def rasterfairy_table_row_repr(khipu_names):
    row_rep = "<tr>"
    for name in khipu_names:
        row_rep += f"<td>{rasterfairy_table_cell_repr(name)}</td>"
    row_rep += "</tr>"  
    return row_rep

row_height = "60"
def rasterfairy_table_cell_repr(khipu_name):
    if khipu_name and (not khipu_name.startswith("XX")):
        tile_left,tile_top = (0 + 50, 0 + 170)
        tile_string = f"<div style=\"border: 0px solid #DDD; height: {row_height}px;text-align:center;vertical-align:middle;\">"
        tile_string += f"<div class=\"khipu_title\">{khipu_name}</div>\n"
        tile_string += f"<a target=\"_blank\" href=\"https://www.khipufieldguide.com/sketchbook/khipu/{khipu_name}_sketch.html\">" 
        image_size = thumbnail_image_size(khipu_name)
        tile_string += f"<img style=\"border: 1px solid #000;\" alt=\"{khipu_name}\" align=\"center\" width=\"{round(image_size[0]/3)}px;\" src= \"https://www.khipufieldguide.com/sketchbook/images/thumbnails/{khipu_name}_wide.jpg\"/>\n"
        tile_string += "</a></div>\n"
    else:
        tile_string = f"<div style=\"border: 0px solid #FFE; height: {row_height}px;text-align:center;vertical-align:middle;\"></div>\n"
    return tile_string

#rasterfairy_table_repr()

def make_rasterfairy_quilt_file(title_string="The Khipu Image Similarity Browser",
                                output_filename = "rasterfairy_imagequilt"):
    
    sort_string = "Sorted by Image Similarity" 
    quilt_string = rasterfairy_table_repr()

    template_file = "khipu_image_quilt_template"
    html_template_filename = f"{ku.project_directory()}/code/templates/{template_file}.html"
    
    output_dir = f"{ku.project_directory()}/notebook/fieldmarks/appendix/notebooks"
    rasterfairy_quilt_html_filename = f"{output_dir}/{output_filename}.html"
    with open(html_template_filename, 'r') as reader:
        html_template = reader.read()

    html_template = html_template.replace("____TITLE_STRING____", title_string)
    html_template = html_template.replace("____SORTED_BY____", sort_string)
    html_template = html_template.replace("____QUILT_TABLE____", quilt_string)
    with open(rasterfairy_quilt_html_filename, 'w') as f: f.write(html_template)
    
    os.system(f"open -a 'Brave Browser' '{rasterfairy_quilt_html_filename}'")
    
make_rasterfairy_quilt_file()

The RasterFairy Quilt

The quilt deforms the projection so that every khipu occupies a unique spot on a grid, the eponymous raster-fairy layout. Interestingly, it appears that the first level sort is very simple - width vs height. From there on it uses additional pixel level features (edge detection, etc.) to sort the images by closeness.

Evaluation of How well the RasterFairy did.

The following pairs were identified in a previous appendix as duplicate khipus, by Manuel Medrano. We can compute the distance for the grid points for each pair to see how well the algorithm did at grouping duplicates.

UR035/AS70
UR043/AS30
UR083/AS208
UR126/UR115
UR133/UR036
UR163/AS056
UR176/LL01
UR236/AS181
UR281/AS068
HP041/AS046

As a baselines, let’s use the KFG Similarity index, created by a Hierarchical Clustering approach on textual documents describing the khipu.

Code

matching_pairs = [['UR035', 'AS070'], 
                  ['UR043', 'AS030'], 
                  ['UR083', 'AS208'], 
                  ['UR126', 'UR115'], 
                  ['UR133', 'UR036'], 
                  ['UR163', 'AS056'], 
                  ['UR176', 'LL01'], 
                  ['UR236', 'AS181'], 
                  ['UR281', 'AS068'], 
                  ['HP041', 'AS046']]


khipu_summary_df = kq.fetch_khipu_summary()
khipu_similarity = {x[0]:x[1] for x in zip(khipu_summary_df['name'], khipu_summary_df['hierarchical_cluster_pos'])}

print(ku.box_text_msg("HIERARCHICAL CLUSTERING PAIR DISTANCES"))
hc_distances = []
for pair in matching_pairs:
    (pos1,pos2) = (khipu_similarity[pair[0]], khipu_similarity[pair[1]])
    pair_distance = abs(pos1 - pos2)
    hc_distances.append(pair_distance)
    print(f"{pair}: {pair_distance}")
print(f"-------------------------\nMean Hierarchical Cluster distance is {statistics.mean(hc_distances)}")

+------------------------------------------------------------------------------+
|HIERARCHICAL CLUSTERING PAIR DISTANCES                                        |
+------------------------------------------------------------------------------+
['UR035', 'AS070']: 1
['UR043', 'AS030']: 68
['UR083', 'AS208']: 134
['UR126', 'UR115']: 1
['UR133', 'UR036']: 1
['UR163', 'AS056']: 297
['UR176', 'LL01']: 1
['UR236', 'AS181']: 310
['UR281', 'AS068']: 57
['HP041', 'AS046']: 122
-------------------------
Mean Hierarchical Cluster distance is 99.2

Now let’s use visual grid matching and see how that changes the mean distance:

Code

# Use this color for later...
plotly_color_dict = {name:"#ffffaa" for name in image_names}
matching_list = []
for pair in matching_pairs:
    cell_color = ku.cycled_color_12()
    plotly_color_dict[pair[0]] = cell_color
    plotly_color_dict[pair[1]] = cell_color
    matching_list.append(pair[0])
    matching_list.append(pair[1])
matching_list = set(matching_list)

grid_position_dict = {}
for row_num in range(len(named_rasterfairy_points_df)):
    record = named_rasterfairy_points_df.iloc[row_num]
    grid_position_dict[record['name']] = (record['x'], record['y'])

print(ku.box_text_msg("RASTERFAIRY GRID PAIR DISTANCES"))
grid_distances = []
for pair in matching_pairs:
    (pos1,pos2) = (grid_position_dict[pair[0]], grid_position_dict[pair[1]])
    dx, dy = (pos2[0]-pos1[0], pos2[1]-pos1[1])
    pair_distance = round(math.sqrt(dx*dx + dy*dy))
    grid_distances.append(pair_distance)
    print(f"{pair}: {pair_distance}")
print(f"Mean grid distance is {statistics.mean(grid_distances)}")


named_grid_points = [(name, plotly_color_dict[name], 5 if name in matching_list else 1, position[0], position[1]) for (name, position) in grid_position_dict.items()]
named_grid_points_df = pd.DataFrame(named_grid_points, columns=['name', 'color', 'match_size', 'x', 'y'])
point_colors = [plotly_color_dict[name] for name in image_names]
fig = (px.scatter(named_grid_points_df, x="x", y="y", color="color", size="match_size",
                  hover_name='name', hover_data=['x', 'y'], 
                  title=f"<b>Image Similarity in Projected 2D Space (using RasterFairy Grid) - Big Circles are Matching Pairs</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False, paper_bgcolor='#fffff8')
        .update(layout_coloraxis_showscale=False)
        .show()
      )

+------------------------------------------------------------------------------+
|RASTERFAIRY GRID PAIR DISTANCES                                               |
+------------------------------------------------------------------------------+
['UR035', 'AS070']: 8
['UR043', 'AS030']: 11
['UR083', 'AS208']: 1
['UR126', 'UR115']: 1
['UR133', 'UR036']: 2
['UR163', 'AS056']: 1
['UR176', 'LL01']: 4
['UR236', 'AS181']: 5
['UR281', 'AS068']: 1
['HP041', 'AS046']: 14
Mean grid distance is 4.8

A major improvement! A good algorithm but not a great one…

Let’s use the original tSNE/UMAP data instead of the grid and see if that improves it..

Code

# Use this color for later...
plotly_color_dict = {name:"#ffffaa" for name in image_names}
matching_list = []
for pair in matching_pairs:
    cell_color = ku.cycled_color_12()
    plotly_color_dict[pair[0]] = cell_color
    plotly_color_dict[pair[1]] = cell_color
    matching_list.append(pair[0])
    matching_list.append(pair[1])
matching_list = set(matching_list)

umap_file = f"{pixplot_directory}/data/layouts/umap_jittered_layout.json"
with open(umap_file) as f:
  umap_data_dict = json.load(f)

grid_x_space = round(1./(1-.9166666))
grid_y_space = round(1./(1-.9166666))
umap_points = [(round(item[0]*grid_x_space)+12, round(item[1]*grid_y_space)+12) for item in umap_data_dict]

named_umap_points = [(name, plotly_color_dict[name], 5 if name in matching_list else 1, position[0], position[1]) for (name, position) in  zip(image_names, umap_points)]
named_umap_points_df = pd.DataFrame(named_umap_points, columns=['name', 'color', 'match_size', 'x', 'y'])

tsne_position_dict = {}
for row_num in range(len(named_umap_points_df)):
    record = named_umap_points_df.iloc[row_num]
    tsne_position_dict[record['name']] = (record['x'], record['y'])

print(ku.box_text_msg("t-SNE PAIR DISTANCES"))
tsne_distances = []
for pair in matching_pairs:
    (pos1,pos2) = (tsne_position_dict[pair[0]], tsne_position_dict[pair[1]])
    dx, dy = (pos2[0]-pos1[0], pos2[1]-pos1[1])
    pair_distance = round(math.sqrt(dx*dx + dy*dy))
    tsne_distances.append(pair_distance)
    print(f"{pair}: {pair_distance}")
print(f"-------------------------\nMean tsne_distance is {statistics.mean(tsne_distances)}\n\n")
print("")

point_colors = [plotly_color_dict[name] for name in image_names]
fig = (px.scatter(named_umap_points_df, x="x", y="y", color="color", size="match_size",
                  hover_name='name', hover_data=['x', 'y'], 
                  title=f"<b>Image Similarity in Projected 2D Space (using UMAP/tSNE) - Big Circles are Matching Pairs</b>",
                  width=1000, height=1000)
        .update_layout(showlegend=False, paper_bgcolor='#fffff8')
        .update(layout_coloraxis_showscale=False)
        .show()
      )

+------------------------------------------------------------------------------+
|t-SNE PAIR DISTANCES                                                          |
+------------------------------------------------------------------------------+
['UR035', 'AS070']: 5
['UR043', 'AS030']: 8
['UR083', 'AS208']: 0
['UR126', 'UR115']: 0
['UR133', 'UR036']: 0
['UR163', 'AS056']: 0
['UR176', 'LL01']: 2
['UR236', 'AS181']: 2
['UR281', 'AS068']: 4
['HP041', 'AS046']: 10
-------------------------
Mean tsne_distance is 3.1

Unsurprisingly, it’s better. It’s still bad where the above approach is bad, but overall the 2D projection is better in accuracy than the deformed grid. It may even be better in the original unprojected space, but still bad where the above approach is bad!

In the above graph, not all khipus are visible, since some khipus lie on top of each other after the 2D tSNE projection from n-dimensional space. For example, UR083 and AS208 (the lime-green circle) lie on top of each other.

Reviewing a RasterFairy Grid of Duplicates

Let’s redraw the quilt showing the matches to see how well they match visually, and what influences their placement in the rasterfairy quilt.

Code

def rasterfairy_table_cell_repr(khipu_name):
    if khipu_name and (not khipu_name.startswith("XX")):
        tile_color = plotly_color_dict[khipu_name]
        is_matching_khipu = khipu_name in matching_list
        tile_left,tile_top = (0 + 50, 0 + 170)
        border_style = f"border:3px solid {tile_color};" if is_matching_khipu else "border:0px solid #DDD;"
        tile_string = f"<div style=\"{border_style} height:{row_height}px;text-align:center;vertical-align:middle;\">"
        tile_string += f"<div class=\"khipu_title\">{khipu_name}</div>\n"
        tile_string += f"<a target=\"_blank\" href=\"https://www.khipufieldguide.com/sketchbook/khipu/{khipu_name}_sketch.html\">" 
        image_size = thumbnail_image_size(khipu_name)
        tile_string += f"<img style=\"border: 1px solid #000;\" alt=\"{khipu_name}\" align=\"center\" width=\"{round(image_size[0]/3)}px;\" src= \"https://www.khipufieldguide.com/sketchbook/images/thumbnails/{khipu_name}_wide.jpg\"/>\n"
        tile_string += "</a></div>\n"
    else:
        tile_string = f"<div style=\"border: 0px solid #FFE; height: {row_height}px;text-align:center;vertical-align:middle;\"></div>\n"
    return tile_string

make_rasterfairy_quilt_file(title_string="The Duplicate Khipu Image Browser",
                            output_filename = "rasterfairy_duplicate_imagequilt")

The Duplicate RasterFairy Quilt